使用Docker在Elasticsearch + Kibana环境中尝试使用kuromoji + Neologd

3 年 ago

清, 扬

4 minutes

目标

使用Docker创建了一个运行Elasticsearch + Kibana环境的环境。虽然安装了kuromoji，但需要维护字典，所以使用mecab-ipadic-neologd。幸运的是，Elasticsearch提供了插件，可以使用它。

更新Docker容器

环境

elasticsearch-analysis-kuromoji-neologd 5.1

更新 Dockerfile

请在上次的Dockerfile中添加elasticsearch-plugin install org.codelibs:elasticsearch-analysis-kuromoji-neologd:5.1.0。以下是Dockerfile的内容。在这个状态下更新容器。

FROM elasticsearch:5.1

# x-pack をインストール
RUN elasticsearch-plugin  install --batch x-pack

# kuromojiをインストール
RUN elasticsearch-plugin  install analysis-kuromoji

# Elasticsearch Analysis Kuromoji Neologd をインストール
RUN elasticsearch-plugin install org.codelibs:elasticsearch-analysis-kuromoji-neologd:5.1.0

docker-compose.yml的中国生产者给出一个选项Hash: docker-compose.yml生产者的阳性
(Note: The given translation is a direct literal translation. However, it does not provide a clear context of the phrase “docker-compose.yml”. Chinese is a contextual language and the translation may better suit the context.)

这次没有任何变动，只是再次发布。

version: '2'
services:
  elasticsearch0:
    build: es
    volumes:
        - es-data0:/usr/share/elasticsearch/data 
        - ./es/config:/usr/share/elasticsearch/config 
    ports:
        - 9200:9200
    expose:
        - 9300
    environment:
        - NODE_NAME=node0
    hostname: elasticsearch0
    ulimits:
        nofile:
            soft: 65536
            hard: 65536
  kibana:
    build: kibana
    links:
        - elasticsearch0:elasticsearch 
    ports:
        - 5601:5601

volumes:
    es-data0:
        driver: local

試用 Kuromoji Neologd 进行分析

模板

curl -XPUT http://localhost:9200/_template/items_template?pretty -d '
{
    "template": "items",
    "settings": {
        "index":{
            "analysis":{
                "tokenizer": {
                    "my_tokenizer": {
                        "type": "kuromoji_neologd_tokenizer",
                        "mode": "normal",
                        "discard_punctuation" : "false",
                        "user_dictionary" : "userdict_ja.txt"
                    }
                },
                "filter": {
                    "synonym_dict": {
                        "type": "synonym",
                        "synonyms_path" : "synonym.txt"
                    }
                },
                "analyzer" : {
                    "default" : {
                        "type": "custom",
                        "tokenizer": "my_tokenizer",
                        "filter": ["synonym_dict"]
                    }
                }
            }
        }
    },
    "mappings" : {
        "items":{
            "properties" : {
                "id" :{ "type" : "keyword" },
                "text" :{ "type" : "keyword" }
            }
        }
    }
}'

設置設定 -> 首頁 -> 分析 -> 分詞器

设定值

                    "my_tokenizer": {
                        "type": "kuromoji_neologd_tokenizer",
                        "mode": "normal",
                        "discard_punctuation" : "false",
                        "user_dictionary" : "userdict_ja.txt"
                    }

意思

設定項目説明tokenizer名my_tokenizer としています。後の analyzer で参照しています。typeelasticsearch-analysis-kuromoji-neologd が提供する tokenizer を利用します。mode形態素解析のモードを指定します。mecab-ipadic-neologd の結果を確認しやすい様に normal を指定します。デフォルトは search です。discard_punctuation句読点や記号も含めます。一般的には true を指定します。今回は、形態素解析の結果を確認するために、あえてfalseを指定します。user_dictionaryユーザ辞書のファイル名です。$ES_HOME(/usr/share/elasticsearch)/config 配下にファイルを置きます。上の docker-compose.yml ではvolumesで$ES_HOME/config に ./es/config を割り当てているので、touch ./es/config/userdict_ja.txt で空ファイルを作成しておきます。空ファイルがないとElasticSearchの起動に失敗します。

设置 -> 首页 -> 分析 -> 过滤器

设定值

                    "synonym_dict": {
                        "type": "synonym",
                        "synonyms_path" : "synonym.txt"
                    }

意味着

設定項目説明filter名synonym_dict としています。後の analyzer で参照しています。typeフィルタの種類として類義語を扱う synonym をしていしています。synonyms_path類義語を定義したファイル名を指定します。 my_tokenizer の user_dictionary と同様に空ファイルを作成しておきます。

设置 -> 索引 -> 分析 -> 分词器

设定值

                "analyzer" : {
                    "default" : {
                        "type": "custom",
                        "tokenizer": "my_tokenizer",
                        "filter": ["synonym_dict"]
                    }
                }

意思

設定項目説明analyzer名default にすることで、デフォルトの analyzerを定義します。typecustom analyzerをカスタマイズすることを宣言します。tokenizertokenizer に上で定義した tokenizer を使います。filter適用するフィルターを指定します。

过滤器的说明

为了方便查看使用 kuromoji + Neologd 进行的形态分析结果，在实际应用中仍需要使用其他过滤器，但限定了要使用的过滤器。

設定項目説明synonym_dict同義語を扱うためのフィルターで、上で定義したsynonym_dictを使用します。https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis-synonym-tokenfilter.html

尝试使用分析器。

数据导入

只需插入一条数据，以应用在上面定义的模板上。

curl -XPUT localhost:9200/items/item/1?pretty -d '
{
  "id" : "item-001",
  "text": "MeCab はオープンソースの形態素解析エンジンであり、自然言語処理の基礎となる形態素解析のデファクトとなるツールです。また各言語用バインディングを使うことで Ruby や Python をはじめ多くのさまざまなプログラミング言語から呼び出して利用することもでき大変便利です。"
}'

试用mecab-ipadic-neologd的示例。

我将在analyzer中运行位于https://github.com/neologd/mecab-ipadic-neologd的ディレクトリ中的「中居正広のミになる図書館」（电视朝日系列）10日播放的片段，其中SMAP的中居正広透露了篠原信一过去的误解。

curl 'localhost:9200/items/_analyze?pretty' --data-binary '{
"explain":"false",
"text":"10日放送の「中居正広のミになる図書館」（テレビ朝日系）で、SMAPの中居正広が、篠原信一の過去の勘違いを 明かす一幕があった。"
}'

{
  "tokens" : [
    {
      "token" : "10日",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "放送",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "の",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "「",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "中居正広のミになる図書館",
      "start_offset" : 7,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "」",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "（",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "テレビ朝日",
      "start_offset" : 21,
      "end_offset" : 26,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "系",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "）",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "で",
      "start_offset" : 28,
      "end_offset" : 29,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "、",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "SMAP",
      "start_offset" : 30,
      "end_offset" : 34,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "の",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "中居正広",
      "start_offset" : 35,
      "end_offset" : 39,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : "が",
      "start_offset" : 39,
      "end_offset" : 40,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "、",
      "start_offset" : 40,
      "end_offset" : 41,
      "type" : "word",
      "position" : 16
    },
    {
      "token" : "篠原信一",
      "start_offset" : 41,
      "end_offset" : 45,
      "type" : "word",
      "position" : 17
    },
    {
      "token" : "の",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "word",
      "position" : 18
    },
    {
      "token" : "過去",
      "start_offset" : 46,
      "end_offset" : 48,
      "type" : "word",
      "position" : 19
    },
    {
      "token" : "の",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "word",
      "position" : 20
    },
    {
      "token" : "勘違い",
      "start_offset" : 49,
      "end_offset" : 52,
      "type" : "word",
      "position" : 21
    },
    {
      "token" : "を",
      "start_offset" : 52,
      "end_offset" : 53,
      "type" : "word",
      "position" : 22
    },
    {
      "token" : " ",
      "start_offset" : 53,
      "end_offset" : 54,
      "type" : "word",
      "position" : 23
    },
    {
      "token" : "明かす",
      "start_offset" : 54,
      "end_offset" : 57,
      "type" : "word",
      "position" : 24
    },
    {
      "token" : "一幕",
      "start_offset" : 57,
      "end_offset" : 59,
      "type" : "word",
      "position" : 25
    },
    {
      "token" : "が",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "word",
      "position" : 26
    },
    {
      "token" : "あっ",
      "start_offset" : 60,
      "end_offset" : 62,
      "type" : "word",
      "position" : 27
    },
    {
      "token" : "た",
      "start_offset" : 62,
      "end_offset" : 63,
      "type" : "word",
      "position" : 28
    },
    {
      "token" : "。",
      "start_offset" : 63,
      "end_offset" : 64,
      "type" : "word",
      "position" : 29
    }
  ]
}

将会得到相同的结果。要获取关于词性和发音的详细信息，请将explain设为true。

请参考该页面

https://www.elastic.co/guide/en/elasticsearch/plugins/5.1/analysis-kuromoji.html の Japanese (kuromoji) Analysis Plugin 配下の文書

目标