在使用Docker将Wikipedia的数据导入到elasticsearch之前
使用较旧的Elasticsearch版本(5.6.9)从Logstash导入维基百科的数据,并创建一个环境,以便在Kibana中进行查看。
准备数据
您可以从这里下载维基百科的数据。如果确认内容没有问题,我们将从维基百科日语版的转储中下载名为jawiki-yyyyMMdd-pages-articles-multistream.xml.bz2的文件。
建立环境
目录结构会以这种方式来完成。
.
├── docker-compose.yml
├── elasticsearch
│ └── Dockerfile
├── kibana
│ └── Dockerfile
└── logstash
├── Dockerfile
├── input
│ └── test.xml
└── pipeline
└── wikipedia.conf
Elasticsearch的设置
FROM docker.elastic.co/elasticsearch/elasticsearch:5.6.9
RUN elasticsearch-plugin remove x-pack
RUN elasticsearch-plugin install analysis-kuromoji
RUN elasticsearch-plugin install analysis-icu
为了进行日语搜索,安装analysis-kuromoji和analysis-icu插件。由于没有许可证,需要卸载x-pack。
Kibana的配置
FROM docker.elastic.co/kibana/kibana:5.6.9
RUN kibana-plugin remove x-pack
只需要卸载x-pack模块,没有其他特别的设置。
logstash的配置
首先,我们将下载的维基百科转储数据命名为text.xml,并将其存储在./logstash/input文件夹中。
Dockerfile的描述
FROM docker.elastic.co/logstash/logstash:5.6.9
RUN mkdir -p /usr/share/logstash/input
RUN logstash-plugin remove x-pack
RUN sed -i '/xpack/d' /usr/share/logstash/config/logstash.yml
RUN logstash-plugin install logstash-input-file
RUN logstash-plugin install logstash-filter-xml
RUN logstash-plugin install logstash-filter-mutate
RUN logstash-plugin install logstash-filter-grok
RUN logstash-plugin install logstash-filter-date
RUN logstash-plugin install logstash-output-elasticsearch
RUN logstash-plugin install logstash-codec-multiline
由于需要从 XML 导入数据,因此我们将安装各种插件。
另外一个需要注意的地方是,在这个版本中运行 `sed -i ‘/xpack/d’ /usr/share/logstash/config/logstash.yml` 是必要的。
即使运行了 `logstash-plugin remove x-pack` 来卸载 x-pack,logstash.yml 中仍然会保留 x-pack 的配置,如果不运行这个命令会出现错误。
管道的设置
input {
file {
path => "/usr/share/logstash/input/test.xml"
start_position => beginning
codec => multiline {
pattern => "<page"
negate => true
what => "previous"
auto_flush_interval => 1
}
}
}
filter {
xml {
source => "message"
target => "doc"
id => "id"
store_xml => false
periodic_flush => true
xpath => [ "(page/title/text())[1]", "title" ]
xpath => [ "(page/id/text())[1]", "id" ]
xpath => [ "page/revision/text", "text" ]
}
mutate {
remove_field => ["doc", "path", "host", "message", "tags"]
join => ["id", ""]
join => ["title", ""]
gsub => [
"text", "https?[^\s]+|<text xml:space=\"preserve\">|</text>", " ",
"text", "==See also==(.|\n)+|==References==(.|\n)+|==Further reading==(.|\n)+", " ",
"text", "(\<.+?\>)", " ",
"text", "(\/ref|\{\{[c|C]ite.+?\}\})", " ",
"text", "[\[\[|\]\]|==|=|\(|\)|\{\{|\}\}|]|\#+|'+|\&|\<|\>| ", " ",
"text", "\.", " . ",
"text", "\,", " , ",
"text", "\:", " : ",
"text", "\;", " ; ",
"text", "\/", " \/ ",
"text", '"', ' " ',
"text", " +", " ",
"text", "\. (\. )+", ". ",
"text", '\n *(\n| )*', ' <br> '
]
}
}
output {
elasticsearch {
hosts => "elasticsearch"
index => "wiki-"
document_id => "%{id}"
}
}
索引是一个维基页面。您可以通过Kibana使用wiki-*进行索引注册,并且可以在屏幕上查看。
docker-compose.yml文件的配置
version: "3.3"
services:
elasticsearch:
build: elasticsearch
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ports:
- "9200:9200"
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- ./elasticsearch/data:/usr/share/elasticsearch/data
networks: [elastic]
kibana:
build: kibana
ports:
- "5601:5601"
networks: [elastic]
links:
- elasticsearch:elasticsearch
logstash:
build: logstash
volumes:
- ./logstash/pipeline/:/usr/share/logstash/pipeline/
- ./logstash/input/:/usr/share/logstash/input/
networks: [elastic]
links:
- elasticsearch:elasticsearch
depends_on:
- elasticsearch
networks:
elastic:
之后,构建并执行可能需要一些时间,但数据将从logstash传输到elasticsearch中的维基百科。