在Nagios中监控批处理处理

2 年 ago

雅, 悟

7 minutes

太长不看

在GCE上创建一个用于批量操作的实例（Debian），我想要使用Nagios定期执行Python脚本并监测其结果，以及监控实例本身的状态，而不需花费额外成本（作为Nagios的初学者）。

然而，由于搜索引擎的搜索结果并没有找到最近有关从安装到运行的日语文章，可能是由于搜索能力不足，所以我想做一份备忘录将其总结起来。

“那吉斯”是什么?

一款能够监控 CPU、内存、硬盘等资源状态、进程和服务状态，还能根据输出到日志文件的字符串进行判断，并检测故障发生/解决，并进行故障通知的监控工具。

有Nagios XI（付费版本）和Nagios Core（开源版本）两种选择。
有关差异，请参考此链接。

简而言之，Nagios Core可以提供资源监控、日志监控和故障通知等最基本的功能，但需要另外安装插件才能进行图形显示等可视化操作。而Nagios XI则集成了这些功能，并且还拥有其他用户友好的功能。

我这次主要想要做的事情是，

在批处理实例无法运行之前监控实例并及时通知。

由于只有这两个选项，且不需要频繁进行添加、修正或更改，因此也不需要过于强大的图形用户界面(GUI)，因此我们会在这里使用免费版本进行操作。

補足- 补充完整

除了Zabbix和Prometheus之外，OSS监视工具还有其他的选择。但是在这里，我们会记录下为何不选择它们。
简而言之，主要原因是我们这次希望保持简单，并且Nagios缺少一些方便的功能（过剩设计）而这些功能对我们来说并不必要。

共同的原因 de

在Nagios中，监控的设置是通过文本文件进行记录的，因此可以使用Git进行管理，并且学习成本较低（但需要有一定的搜索能力）。

Zabbix独特之处的理由

尽管Nagios可以在一次取值过程中获取多个值，但是Zabbix在一次条目取值过程中只能获取一个值。

因此，如果要在 Zabbix 中像 Nagios 那样一次获取多个值，需要与 zabbix_sender 等结合使用，在每次值获取处理的机制上获取多个值，并准备一个脚本来使用 zabbix_sender 命令将多个值发送到 Zabbix 服务器以进行相应处理。

因为这次我想要同时运行多个批处理文件（可能还有额外的潜在需求），所以为了在配置和管理上都需要额外的Zabbix支持，这将带来一定的成本。

Prometheus独特的原因

Prometheus 不同于 Naigos 和 Zabbix，其专注于监控云容器，并采用与 All in One 不同的思想进行设计。因此，为了实现通知等功能，需要使用其他组件，因此相比于 Nagios，配置和管理上则变得更加复杂，成本也更高。

Cron- 整合任务计划的工具

虽然可以简单地使用Cron定期执行并在脚本中通知错误，但由于监控实例的生死状态比较麻烦，所以排除了这种方法。（在本地部署不同，但在GCE上进行像这次这样的操作时，由于Stackdriver会自动监控实例的生死状态，所以使用Cron也是可以的，需要考虑的是在一个配置中能够覆盖到什么程度的问题）。

安装Nagios（版本4.4.4）

虽然前言不搭后语，但我们将继续进行Nagios Core的安装。我们将从已经在GCE上运行的实例（Debian）开始操作。（对于其他操作系统，安装帮助提供的参考也适用）

（通过使用apt-get install nagios3进行安装的步骤指南在2019/08/05不再有效。）

安装必要的软件包。

$ sudo apt-get update
$ sudo apt-get install -y autoconf gcc libc6 make wget unzip apache2 apache2-utils php libgd-dev

创建下载文件夹并下载源代码。

$ mkdir downloads
$ cd downloads
$ wget -O nagioscore.tar.gz https://github.com/NagiosEnterprises/nagioscore/archive/nagios-4.4.4.tar.gz
$ tar xzf nagioscore.tar.gz

编译源代码

$ cd nagioscore-nagios-4.4.3
$ ./configure --with-httpd-conf=/etc/apache2/sites-enabled
$ make all

将用户和组添加到Nagios。

$ sudo make install-groups-users
$ sudo usermod -a -G nagios www-data

安装二进制文件。

$ sudo make install

安装Nagios的服务和守护程序文件。

$ sudo make install-daemoninit

安装命令模式。

$ sudo make install-commandmode

安装Nagios配置文件的样本版本。

由于这个配置文件只是一个初始执行的示例，所以之后需要根据自己想要做的事情进行相应的更改。

$ sudo make install-config

安装Apache配置文件。

$ sudo make install-webconf
$ sudo a2enmod rewrite
$ sudo a2enmod cgi

防火墙设置

在这两个命令中，会有一些确认需要输入，但基本上选择“是”就可以了。

$ sudo iptables -I INPUT -p tcp --destination-port 80 -j ACCEPT
$ sudo apt-get install -y iptables-persistent

Nagios 管理用户的创建

请随意设置密码。
同时，在将来添加用户时，需要删除此命令中的 -c。否则，通过此命令创建的管理用户将被替换并消失，所以请务必注意。

$ sudo htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

阿帕奇网络服务器的启动

$ sudo systemctl restart apache2.service

Nagios 服务和守护程序的启动

$ sudo systemctl start nagios.service

一旦到这里，你可以看到Nagios的Web界面。请确保通过http://[服务器IP]/nagios进行访问，并确认是否正常启动。如果显示以下这样的界面，则表示成功。

然而，在这里，因为尚未安装Nagios-Plugin，所以在当前状态下显示为”(No output on stdout) stderr: execvp(/usr/local/nagios/libexec/check_load, …) failed. errno is 2: No such file or directory”这一严重错误。

安装Nagios-Plugins（版本2.2.1）。

为了解决上述错误，按顺序安装所需的Nagios插件。在这里，大致安装完成后，如果还需要其他必要的内容，请参考此链接。

安装必要的软件包。

$ sudo apt-get install -y autoconf gcc libc6 libmcrypt-dev make libssl-dev wget bc gawk dc build-essential snmp libnet-snmp-perl gettext

下载来源

$ cd downloads
$ wget --no-check-certificate -O nagios-plugins.tar.gz https://github.com/nagios-plugins/nagios-plugins/archive/release-2.2.1.tar.gz
$ tar zxf nagios-plugins.tar.gz

将源代码进行编译

$ cd nagios-plugins-release-2.2.1/
$ ./tools/setup
$ ./configure
$ sudo make
$ sudo make install

请再次访问http://[服务器外部IP]/nagios，并点击”Current Status”下的”Hosts or Services”，然后点击”localhost”。
然后，点击”Host Commands”下的”Re-schedule the next check of this host”命令，执行此命令后，将会更新到最新状态。

终于可以使用Nagios了。

定期运行和监控批处理文件。

从这里开始，按顺序设置想要做的事情。首先是1的定期执行和监视。
通过在Nagios中写入一些配置文件的设置，可以定期执行指定的文件，并接收其标准输出，并根据此发出警报。

首先，将此次使用的批处理文件(hello.py)放置（复制）到 /usr/local/nagios/libexec/ 目录中。

$ sudo cp hello.py /usr/local/nagios/libexec/hello.py

检查已配置文件的权限，根据其状态授予权限。

$ sudo chmod 755 /usr/local/nagios/libexec/hello.py

接下来，在 /usr/local/nagios/etc/objects/commands.cfg 文件中添加以下所需执行的命令。

define command {
    command_name    hello
    command_line    $USER1$/hello.py
}

然后，在/usr/local/nagios/etc/objects/templates.cfg文件中添加相应的执行设置，如下所示。

define service {

    name                            batch-service           ; The 'name' of this service template
    active_checks_enabled           1                       ; Active service checks are enabled
    passive_checks_enabled          1                       ; Passive service checks are enabled/accepted
    parallelize_check               1                       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
    obsess_over_service             1                       ; We should obsess over this service (if necessary)
    check_freshness                 0                       ; Default is to NOT check service 'freshness'
    notifications_enabled           1                       ; Service notifications are enabled
    event_handler_enabled           1                       ; Service event handler is enabled
    flap_detection_enabled          1                       ; Flap detection is enabled
    process_perf_data               1                       ; Process performance data
    retain_status_information       1                       ; Retain status information across program restarts
    retain_nonstatus_information    1                       ; Retain non-status information across program restarts
    is_volatile                     0                       ; The service is not volatile
    check_period                    batch_time              ; The service can be checked at 9:00(JST) of the day
    max_check_attempts              1                       ; Re-check the service up to 1 times in order to determine its final (hard) state
    check_interval                  60                      ; Check the service every 60 minutes under normal conditions
    retry_interval                  45                      ; Re-check the service every 45 minutes until a hard state can be determined
    contact_groups                  admins                  ; Notifications get sent out to everyone in the 'admins' group
    notification_options            w,u,c,r                 ; Send notifications about warning, unknown, critical, and recovery events
    notification_interval           60                      ; Re-notify about service problems every hour
    notification_period             24x7                    ; Notifications can be sent out at any time
    register                        0                       ; DON'T REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

为了更加清晰明了，本次为了方便理解全部进行了描述。然而，若与现有设置有很多重复时，也可以省略如下，并只描述变更部分。

define service {

    name                            batch-service           ; The name of this service template
    use                             generic-service         ; Inherit default values from the generic-service definition
    check_period                    batch_time              ; The service can be checked at 9:00(JST) of the day
    max_check_attempts              1                       ; Re-check the service up to 1 times in order to determine its final (hard) state
    check_interval                  60                      ; Check the service every 60 minutes under normal conditions
    retry_interval                  45                      ; Re-check the service every 45 minutes until a hard state can be determined
}

在 /usr/local/nagios/etc/objects/timeperiods.cfg 中，我将 check_period 设置为 batch_time（每日9:00-9:30以日本时间执行），一旦检测到错误，将只执行一次，且一天只执行一次。同时，我将检测间隔（check_interval）设置为超过设定时间间隔的60分钟，并将重试间隔（retry_interval）设置为45分钟。

# run batch file everyday once

define timeperiod {

    name                    batch_time
    timeperiod_name         batch_time
    alias                   run batch at 9:00(JST) everyday

    sunday                  00:00-00:30
    monday                  00:00-00:30
    tuesday                 00:00-00:30
    wednesday               00:00-00:30
    thursday                00:00-00:30
    friday                  00:00-00:30
    saturday                00:00-00:30
    sunday                  00:00-00:30
}

并且，利用上述定义的服务模板，在/usr/local/nagios/etc/objects/localhost.cfg中定义要执行的命令(服务)如下。

# Define a service to run batch file on the local machine.

define service {
    use                     batch-service ; Name of service template to use
    host_name               localhost
    service_description     run the hello
    check_command           hello
}

在这点上，执行环境的设置已经完成。那么，让我们验证一下配置文件是否有错误（如果有未定义等错误，可以在这里检测到）。

$ sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Nagios Core 4.4.4
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2019-07-29
License: GPL

Website: https://www.nagios.org
Reading configuration data...
   Read main config file okay...
   Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
    Checked 9 services.
    Checked 1 hosts.
    Checked 1 host groups.
    Checked 0 service groups.
    Checked 1 contacts.
    Checked 1 contact groups.
    Checked 25 commands.
    Checked 6 time periods.
    Checked 0 host escalations.
    Checked 0 service escalations.
Checking for circular paths...
    Checked 1 hosts
    Checked 0 service dependencies
    Checked 0 host dependencies
    Checked 6 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check

确认没有问题后，重新启动以应用设置。

$ sudo systemctl restart nagios.service

总结起来，只需要更新4个文件并重新启动Nagios即可！很简单！（虽然花了一些时间才找到正确的方法）

填補缺口

顺便提一下，在使用Linux系列操作系统执行Python脚本时，需要在文件开头加上类似于 #!/usr/bin/env python3 这样的shebang，要注意一下（我遇到了这个问题）。

另外，如果处理需要超过1分钟，则需要注意修改/usr/local/nagios/etc/nagios.cfg中的service_check_timeout。这个service_check_timeout的默认值是60秒，所以根据批处理的执行时间进行相应的修改。

# TIMEOUT VALUES
# These options control how much time Nagios will allow various
# types of commands to execute before killing them off.  Options
# are available for controlling maximum time allotted for
# service checks, host checks, event handlers, notifications, the
# ocsp command, and performance data commands.  All values are in
# seconds.

service_check_timeout=600
host_check_timeout=30
event_handler_timeout=30
notification_timeout=300
ocsp_timeout=5
ochp_timeout=5
perfdata_timeout=5

在更改完这个地方后，别忘了使用 $ sudo systemctl restart nagios.service 来重新启动。

另外，在输出错误时，请注意以下事项。
首先，将程序的退出状态设置为 Nagios 插件化的方式。Nagios 监控有四种状态：OK、WARNING、CRITICAL、UNKNOWN，它们分别对应 Nagios 插件的退出状态，即0、1、2、3。
因此，我们需要将 Python 脚本的退出状态与之对应。

ステータスのラベルEXIT STATUS の値OK0WARNING1CRITICAL2UNKNOWN3

接下来，应将错误消息写入标准输出。
在 Nagios 的 WebUI 服务列表中，类似于 Services 的监控项目总结，状态信息会显示在”Status Information”处，这里会显示 Nagios 插件写入标准输出的第一行。然后，打开该监控项目的详细信息后，会显示第二行及之后的内容。
因此，为了传达与监控结果状态相结合的信息，首先在第一行写入重要事项，然后在第二行及之后写入更详细的内容。

1.2 将Nagios的警报通知发送到Slack上

这很简单。因为 Slack 有官方的 Nagios Integration，只需使用它即可。（请注意，由于操作手册过时，某些部分可能无法使用。）

首先，进入Slack的管理界面。在点击添加应用按钮之后，从搜索中找到Nagios并添加。

当您进行追加时，会显示设置步骤，请按照这些步骤一一进行。(由于文件结构不同，如果不按照这些步骤进行，无法成功进行)

首先，安装Slack的Nagios插件，并将其移至执行环境，然后更改权限。

$ cd downloads
$ sudo apt-get install libwww-perl libcrypt-ssleay-perl
$ wget https://raw.github.com/tinyspeck/services-examples/master/nagios.pl
$ cp nagios.pl /usr/local/nagios/libexec/slack_nagios.pl
$ chmod 755 /usr/local/nagios/libexec/slack_nagios.pl

请编辑/usr/local/nagios/libexec/slack_nagios.pl，根据通知目标的 Slack 设置\$opt_domain和\$opt_token变量。

$opt_domain = "YourDomain.slack.com"; # チームドメイン
$opt_token = "NagiosServiceToken"; # Nagios サービスページのトークン

发布一篇测试文章，确认是否能通过。

$ /usr/local/nagios/libexec/slack_nagios.pl -field slack_channel="#nagios" -field HOSTALIAS="HOSTNAME" -field SERVICEDESC="SERVICEDESC" -field SERVICESTATE="SERVICESTATE" -field SERVICEOUTPUT="SERVICEOUTPUT" -field NOTIFICATIONTYPE="NOTIFICATIONTYPE"

如果是这种感觉出现的话，暂时可以认可！

接下来，我们将在Nagios的每个配置文件中添加Slack的设置。首先，在通知接收者的配置文件usr/local/nagios/etc/objects/contacts.cfg中添加Slack的设置。

define contact {
      contact_name                             slack
      alias                                    Slack
      service_notification_period              24x7
      host_notification_period                 24x7
      service_notification_options             w,u,c,r
      host_notification_options                d,r
      service_notification_commands            notify-service-by-slack
      host_notification_commands               notify-host-by-slack
}

define contactgroup {
  contactgroup_name admins
  alias             Nagios Administrators
  members           nagiosadmin, slack
}

接下来，在命令设置文件/usr/local/nagios/etc/objects/commands.cfg中添加通知命令。

define command { 
    command_name        notify-service-by-slack 
    command_line        /usr/local/bin/slack_nagios.pl -field slack_channel="#nagios" -field HOSTALIAS="$HOSTNAME$" -field SERVICEDESC="$SERVICEDESC$" -field SERVICESTATE="$SERVICESTATE$" -field SERVICEOUTPUT="$SERVICEOUTPUT$" -field NOTIFICATIONTYPE="$NOTIFICATIONTYPE$"
}

define command { 
    command_name        notify-host-by-slack 
    command_line        /usr/local/bin/slack_nagios.pl -field slack_channel="#nagios" -field HOSTALIAS="$HOSTNAME$" -field HOSTSTATE="$HOSTSTATE$" -field HOSTOUTPUT="$HOSTOUTPUT$" -field NOTIFICATIONTYPE="$NOTIFICATIONTYPE$"
}

我会验证设置是否有错误。

$ sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

如果没有问题，重新启动Nagios。

$ sudo systemctl restart nagios.service

让我们实际操作一下，尝试发送警报。根据通知类型，它会在Slack上以不同颜色进行通知。

关于2，由于目前的设置基本完成，就暂时结束吧！（好累啊。。。。）

由于我还是个新手，在探索过程中可能出现错误或者前置条件不准确的情况，请提供意见和评论m(_ _)m

请在下列选项中选择一个适合的答案

Nagios Core 官方版本

请提供有关Nagios Core/Plugin安装的帮助。

Nagios的每个配置文件的变量名解释。

Nagios的时间定义

在Nagios中如何解决服务检查超时问题。

Nagios 和 Slack 的整合

与Zabbix进行对比

Zabbix和Prometheus的对比

添加附注

version:

Latest stable release 2019-07-29 Nagios Core(ver.4.4.4)
Nagios-Plugin 2.2.1 が最新版(2019-08-05時点)

本読んだほうがいいのかなぁ。でも本は多分バージョン3。。。