在使用Docker将Wikipedia的数据导入到elasticsearch之前

使用较旧的Elasticsearch版本(5.6.9)从Logstash导入维基百科的数据,并创建一个环境,以便在Kibana中进行查看。

准备数据

您可以从这里下载维基百科的数据。如果确认内容没有问题,我们将从维基百科日语版的转储中下载名为jawiki-yyyyMMdd-pages-articles-multistream.xml.bz2的文件。

建立环境

目录结构会以这种方式来完成。

.
├── docker-compose.yml
├── elasticsearch
│   └── Dockerfile
├── kibana
│   └── Dockerfile
└── logstash
    ├── Dockerfile
    ├── input
    │   └── test.xml
    └── pipeline
        └── wikipedia.conf

Elasticsearch的设置

FROM docker.elastic.co/elasticsearch/elasticsearch:5.6.9

RUN elasticsearch-plugin remove x-pack

RUN elasticsearch-plugin install analysis-kuromoji

RUN elasticsearch-plugin install analysis-icu

为了进行日语搜索,安装analysis-kuromoji和analysis-icu插件。由于没有许可证,需要卸载x-pack。

Kibana的配置

FROM docker.elastic.co/kibana/kibana:5.6.9

RUN kibana-plugin remove x-pack

只需要卸载x-pack模块,没有其他特别的设置。

logstash的配置

首先,我们将下载的维基百科转储数据命名为text.xml,并将其存储在./logstash/input文件夹中。

Dockerfile的描述

FROM docker.elastic.co/logstash/logstash:5.6.9

RUN mkdir -p /usr/share/logstash/input

RUN logstash-plugin remove x-pack

RUN sed -i '/xpack/d' /usr/share/logstash/config/logstash.yml

RUN logstash-plugin install logstash-input-file

RUN logstash-plugin install logstash-filter-xml

RUN logstash-plugin install logstash-filter-mutate

RUN logstash-plugin install logstash-filter-grok

RUN logstash-plugin install logstash-filter-date

RUN logstash-plugin install logstash-output-elasticsearch

RUN logstash-plugin install logstash-codec-multiline

由于需要从 XML 导入数据,因此我们将安装各种插件。
另外一个需要注意的地方是,在这个版本中运行 `sed -i ‘/xpack/d’ /usr/share/logstash/config/logstash.yml` 是必要的。
即使运行了 `logstash-plugin remove x-pack` 来卸载 x-pack,logstash.yml 中仍然会保留 x-pack 的配置,如果不运行这个命令会出现错误。

管道的设置

input {
  file {
    path => "/usr/share/logstash/input/test.xml"
    start_position => beginning
    codec => multiline {
        pattern => "<page"
        negate => true
        what => "previous"
        auto_flush_interval => 1
    }
  }
}
filter {
    xml {
        source => "message"
        target => "doc"
        id => "id"
        store_xml => false
        periodic_flush => true
        xpath => [ "(page/title/text())[1]", "title" ]
        xpath => [ "(page/id/text())[1]", "id" ]
        xpath => [ "page/revision/text", "text" ]
    }
    mutate {
        remove_field => ["doc", "path", "host", "message", "tags"]
        join => ["id", ""]
        join => ["title", ""]
        gsub => [
            "text", "https?[^\s]+|<text xml:space=\"preserve\">|</text>", " ",
            "text", "==See also==(.|\n)+|==References==(.|\n)+|==Further reading==(.|\n)+", " ",
            "text", "(\&lt;.+?\&gt;)", " ",
            "text", "(\/ref|\{\{[c|C]ite.+?\}\})", " ",
            "text", "[\[\[|\]\]|==|=|\(|\)|\{\{|\}\}|]|\#+|'+|\&amp;|\&lt;|\&gt;|&nbsp;", " ",
            "text", "\.", " . ",
            "text", "\,", " , ",
            "text", "\:", " : ",
            "text", "\;", " ; ",
            "text", "\/", " \/ ",
            "text", '"', ' " ',
            "text", " +", " ",
            "text", "\. (\. )+", ". ",
            "text", '\n *(\n| )*', ' <br> '
        ]
    }
}
output {
   elasticsearch {
     hosts => "elasticsearch"
     index => "wiki-"
     document_id => "%{id}"
  }
}

索引是一个维基页面。您可以通过Kibana使用wiki-*进行索引注册,并且可以在屏幕上查看。

docker-compose.yml文件的配置

version: "3.3"

services:
  elasticsearch:
    build: elasticsearch
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - ./elasticsearch/data:/usr/share/elasticsearch/data
    networks: [elastic]

  kibana:
    build: kibana
    ports:
      - "5601:5601"
    networks: [elastic]
    links:
      - elasticsearch:elasticsearch

  logstash:
    build: logstash
    volumes:
      - ./logstash/pipeline/:/usr/share/logstash/pipeline/
      - ./logstash/input/:/usr/share/logstash/input/
    networks: [elastic]
    links:
      - elasticsearch:elasticsearch
    depends_on:
      - elasticsearch

networks:
  elastic:

之后,构建并执行可能需要一些时间,但数据将从logstash传输到elasticsearch中的维基百科。

广告
将在 10 秒后关闭
bannerAds