记录下将346万条访问日志通过Logstash发送到Elasticsearch的时间
首先
只是为了在稍后使用Kibana玩耍,我们同时使用Logstash将数据投入Elasticsearch中,出于好奇心想要确认一下能够以多快的速度投入多少数据。顺便提一下,我们的结论是在大约23分钟内完成。
运行环境
在Mac上安装Docker,并在其上通过容器运行Logstash和Elasticsearch。
-
- PC
MacBook Air (13-inch, Early 2015)
macOS High Sierra バージョン 10.13.3
プロセッサ 1.6 GHz Intel Core i5
メモリ 8 GB 1600 MHz DDR3
Docker Version 18.03.0-ce-mac59 (23608)
Elasticsearch 6.2.3
Logstash 6.2.3
安装 Elasticsearch 6.2.3。
由于我之前使用的东西已经有点老旧了,所以在进行工作时拿到了最新的设备。
$ docker pull docker.elastic.co/elasticsearch/elasticsearch:6.2.3
在执行容器之前,根据获取的映像来创建网络以连接Elasticsearch、Logstash和Kibana。
$ docker network create elasticnw
在创建的网络上挂载容器——通过添加–net选项来运行容器。
$ docker run -dti -h elasticsearch --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --net elasticnw docker.elastic.co/elasticsearch/elasticsearch:6.2.3 /bin/bash
Elasticsearch的容器默认以root用户附加进入,但无法以root用户执行Elasticsearch,还有一个设了陷阱的简易启动脚本需要手动输入启动命令,这很麻烦。为了方便,我创建了一个简单的启动脚本并将其放置在elasticsearch\bin的路径下,虽然有些粗暴。
#!/bin/bash
#--------------------------------------------------
# elasticsearch startup shell
#--------------------------------------------------
ELASTIC_HOME="/usr/share/elasticsearch"
su - elasticsearch -c $ELASTIC_HOME/bin/elasticsearch > $ELASTIC_HOME/logs/elastic.log &
当启动日志稳定后,通过主机操作系统进行连通确认。
$ curl -s http://localhost:9200
{
"name" : "pBmvODk",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "7QL3pMI6T2SLyxc22KdYxg",
"version" : {
"number" : "6.2.3",
"build_hash" : "c59ff00",
"build_date" : "2018-03-13T10:06:29.741383Z",
"build_snapshot" : false,
"lucene_version" : "7.2.1",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
通信良好。接下来是Logstash。
安装Logstash
通过使用容器来安装Logstash。使用-v选项将主机操作系统的目录挂载到容器中,使用–net选项将其连接到Elasticsearch相同的网络中。启动容器后,附加并进入容器。
$ docker pull docker.elastic.co/logstash/logstash:6.2.3
$ docker run -dti -h logstash --name logstash -v /work/data/logstash:/data --net elasticnw docker.elastic.co/logstash/logstash:6.2.3 /bin/bash
$ docker attach logstash
由于这次处理的数据是 Apache HTTP Server 的常见格式的 accesslog,所以需要创建以下的 conf 文件。先将它放置在 Logstash 的主目录下的 config 文件夹中。为什么索引名称是 nasa 呢?因为在网上寻找大量日志数据的样本时,我发现了NASA在1995年7月和8月发布的关于其WWW网站访问日志的历史数据,所以我选择使用了这些数据。这个原因就藏在这里。
input {
file {
path => "/data/access_log_*"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
if [path] =~ "access" {
mutate { replace => { "type" => "apache_access" } }
grok {
match => { "message" => "%{COMMONAPACHELOG}" }
}
}
date {
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
output {
elasticsearch {
hosts => [ "elasticsearch" ]
index => "nasa"
}
}
将下载的数据放置在主机操作系统的数据文件夹中。
$ mv ~/Downloads/access_log_*95 /work/data/logstash
$ cd /work/data/logstash
$ ls -l
-rw-r--r--@ 1 xxxx@xxxx staff 167813770 4 16 17:18 access_log_Aug95
-rw-r--r--@ 1 xxxx@xxxx staff 205242368 4 16 17:17 access_log_Jul95
在容器内也进行确认,确实存在。总共约有346万行。文件的时间微妙地偏离是因为容器的时间仍然是UTC时间。
bash-4.2$ ls -l /data
total 364316
-rw-r--r-- 1 logstash logstash 167813770 Apr 16 08:18 access_log_Aug95
-rw-r--r-- 1 logstash logstash 205242368 Apr 16 08:17 access_log_Jul95
bash-4.2$ wc -l access_log_*
1569898 access_log_Aug95
1891714 access_log_Jul95
3461612 total
以上是准备日志所需的全部步骤。
准备简易测量脚本。
我觉得可能还有更好的方法,但是可以创建一个脚本,每秒向Elasticsearch索引内的文档数发出查询。
#!/bin/sh
ESURL="http://localhost:9200"
if [ $# == 0 ]
then
echo
echo "usage: $0 INDEX"
echo
exit 1
else
INDEX_NAME=$1
fi
while true
do
echo `date +[%Y/%m/%d" "%T]` `curl -s $ESURL/$INDEX_NAME/_count`
sleep 1
done
exit 0
别忘记设置权限。
$ chmod 755 countesdocs.sh
执行测量
同时运行Logstash和计量脚本。计量脚本在主机操作系统上执行,通过tail命令查看输出日志。
bash-4.2$ logstash -f /usr/share/logstash/config/accesslog.conf
$ ./countesdocs.sh nasa > logstash_nasa.log &
$ tail -f logstash_nasa.log
大约在执行 Logstash 后的1分钟后开始输入数据。
[2018/04/17 10:50:27] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"},"status":404}
[2018/04/17 10:50:28] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"},"status":404}
[2018/04/17 10:50:29] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"},"status":404}
--- 中略 ---
[2018/04/17 10:51:29] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"nasa","index_uuid":"_na_","index":"nasa"},"status":404}
[2018/04/17 10:51:30] {"count":0,"_shards":{"total":5,"successful":4,"skipped":0,"failed":0}}
[2018/04/17 10:51:31] {"count":668,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
根据开头的记录,从10:50:27开始到大约23分钟后完成了投入(11:12:52)。虽然没有完全统计,但每秒大约可以投入2000至4000条数据。虽然没有监控CPU和内存,但对于这种程度的Mac来说应该应付得了吧。
[2018/04/17 11:12:50] {"count":3454294,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
[2018/04/17 11:12:51] {"count":3458432,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
[2018/04/17 11:12:52] {"count":3461612,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
[2018/04/17 11:12:53] {"count":3461612,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
[2018/04/17 11:12:54] {"count":3461612,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0}}
如果有一个良好的服务器或集群架构,性能应该会更好。我想是这样的。
请看这个。
使用Logstash将Apache日志输入到Elasticsearch的方法
Logstash配置示例
使用Docker安装Elasticsearch
搜索API
公开可访问的access.log数据集