记录下将346万条访问日志通过Logstash发送到Elasticsearch的时间

首先

只是为了在稍后使用Kibana玩耍,我们同时使用Logstash将数据投入Elasticsearch中,出于好奇心想要确认一下能够以多快的速度投入多少数据。顺便提一下,我们的结论是在大约23分钟内完成。

运行环境

在Mac上安装Docker,并在其上通过容器运行Logstash和Elasticsearch。

    • PC

MacBook Air (13-inch, Early 2015)
macOS High Sierra バージョン 10.13.3
プロセッサ 1.6 GHz Intel Core i5
メモリ 8 GB 1600 MHz DDR3

Docker Version 18.03.0-ce-mac59 (23608)
Elasticsearch 6.2.3
Logstash 6.2.3

安装 Elasticsearch 6.2.3。

由于我之前使用的东西已经有点老旧了,所以在进行工作时拿到了最新的设备。

$ docker pull docker.elastic.co/elasticsearch/elasticsearch:6.2.3

在执行容器之前,根据获取的映像来创建网络以连接Elasticsearch、Logstash和Kibana。

$ docker network create elasticnw

在创建的网络上挂载容器——通过添加–net选项来运行容器。

$ docker run -dti -h elasticsearch --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --net elasticnw docker.elastic.co/elasticsearch/elasticsearch:6.2.3 /bin/bash

Elasticsearch的容器默认以root用户附加进入,但无法以root用户执行Elasticsearch,还有一个设了陷阱的简易启动脚本需要手动输入启动命令,这很麻烦。为了方便,我创建了一个简单的启动脚本并将其放置在elasticsearch\bin的路径下,虽然有些粗暴。

#!/bin/bash

#--------------------------------------------------
# elasticsearch startup shell
#--------------------------------------------------

ELASTIC_HOME="/usr/share/elasticsearch"

su - elasticsearch -c $ELASTIC_HOME/bin/elasticsearch > $ELASTIC_HOME/logs/elastic.log &

当启动日志稳定后,通过主机操作系统进行连通确认。

$ curl -s http://localhost:9200
{
  "name" : "pBmvODk",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "7QL3pMI6T2SLyxc22KdYxg",
  "version" : {
    "number" : "6.2.3",
    "build_hash" : "c59ff00",
    "build_date" : "2018-03-13T10:06:29.741383Z",
    "build_snapshot" : false,
    "lucene_version" : "7.2.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

通信良好。接下来是Logstash。

安装Logstash

通过使用容器来安装Logstash。使用-v选项将主机操作系统的目录挂载到容器中,使用–net选项将其连接到Elasticsearch相同的网络中。启动容器后,附加并进入容器。

$ docker pull docker.elastic.co/logstash/logstash:6.2.3
$ docker run -dti -h logstash --name logstash -v /work/data/logstash:/data --net elasticnw docker.elastic.co/logstash/logstash:6.2.3 /bin/bash
$ docker attach logstash

由于这次处理的数据是 Apache HTTP Server 的常见格式的 accesslog,所以需要创建以下的 conf 文件。先将它放置在 Logstash 的主目录下的 config 文件夹中。为什么索引名称是 nasa 呢?因为在网上寻找大量日志数据的样本时,我发现了NASA在1995年7月和8月发布的关于其WWW网站访问日志的历史数据,所以我选择使用了这些数据。这个原因就藏在这里。

input {
  file {
    path => "/data/access_log_*"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
  if [path] =~ "access" {
    mutate { replace => { "type" => "apache_access" } }
    grok {
      match => { "message" => "%{COMMONAPACHELOG}" }
    }
  }
  date {
    match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
}

output {
    elasticsearch {
        hosts => [ "elasticsearch" ]
        index => "nasa"
    }
}

将下载的数据放置在主机操作系统的数据文件夹中。

$ mv ~/Downloads/access_log_*95 /work/data/logstash
$ cd /work/data/logstash
$ ls -l
-rw-r--r--@ 1 xxxx@xxxx  staff  167813770  4 16 17:18 access_log_Aug95
-rw-r--r--@ 1 xxxx@xxxx  staff  205242368  4 16 17:17 access_log_Jul95

在容器内也进行确认,确实存在。总共约有346万行。文件的时间微妙地偏离是因为容器的时间仍然是UTC时间。

bash-4.2$ ls -l /data
total 364316
-rw-r--r-- 1 logstash logstash 167813770 Apr 16 08:18 access_log_Aug95
-rw-r--r-- 1 logstash logstash 205242368 Apr 16 08:17 access_log_Jul95
bash-4.2$ wc -l access_log_*
  1569898 access_log_Aug95
  1891714 access_log_Jul95
  3461612 total

以上是准备日志所需的全部步骤。

准备简易测量脚本。

我觉得可能还有更好的方法,但是可以创建一个脚本,每秒向Elasticsearch索引内的文档数发出查询。

#!/bin/sh

ESURL="http://localhost:9200"

if [ $# == 0 ]
then
  echo
  echo "usage: $0 INDEX"
  echo
  exit 1
else
  INDEX_NAME=$1
fi

while true
do
  echo `date +[%Y/%m/%d" "%T]` `curl -s $ESURL/$INDEX_NAME/_count`
  sleep 1
done

exit 0

别忘记设置权限。

$ chmod 755 countesdocs.sh

执行测量

同时运行Logstash和计量脚本。计量脚本在主机操作系统上执行,通过tail命令查看输出日志。

bash-4.2$ logstash -f /usr/share/logstash/config/accesslog.conf 
$ ./countesdocs.sh nasa > logstash_nasa.log &
$ tail -f logstash_nasa.log

大约在执行 Logstash 后的1分钟后开始输入数据。

[2018/04/17 10:50:27] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"},"status":404}
[2018/04/17 10:50:28] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"},"status":404}
[2018/04/17 10:50:29] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"},"status":404}

--- 中略 ---

[2018/04/17 10:51:29] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"},"status":404}
[2018/04/17 10:51:30] {"count":0,"_shards":{"total":5,"successful":4,"skipped":0,"failed":0}}
[2018/04/17 10:51:31] {"count":668,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}

根据开头的记录,从10:50:27开始到大约23分钟后完成了投入(11:12:52)。虽然没有完全统计,但每秒大约可以投入2000至4000条数据。虽然没有监控CPU和内存,但对于这种程度的Mac来说应该应付得了吧。

[2018/04/17 11:12:50] {"count":3454294,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
[2018/04/17 11:12:51] {"count":3458432,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
[2018/04/17 11:12:52] {"count":3461612,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
[2018/04/17 11:12:53] {"count":3461612,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
[2018/04/17 11:12:54] {"count":3461612,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}

如果有一个良好的服务器或集群架构,性能应该会更好。我想是这样的。

请看这个。

使用Logstash将Apache日志输入到Elasticsearch的方法
Logstash配置示例
使用Docker安装Elasticsearch
搜索API
公开可访问的access.log数据集

广告
将在 10 秒后关闭
bannerAds