有关ElasticSearch分析器的内容

2 年 ago

新, 韵

2 minutes

分析仪的构建方式

Token filters

分析器的处理流程

<html><body>Quick Brown Fox!</body></html>

↓ char_filter => html_strip

Quick Brown Fox!

↓ tokenizer => whitespace

Quick
Brown
Fox!

↓ filter => lowercase

quick
brown
fox!

字符过滤器

在对文本进行分词之前，进行必要的处理（添加、删除、修改）的操作。

'ngram_analyzer': {
   tokenizer: 'ngram_tokenizer',
   filter: ['kana_filter'],
   char_filter: [:icu_normalizer] # <- ?
}

ICU规范化程序

將Unicode進行規範化的方法
例如

GET _analyze
{
  "text": "㈱Ｌｉｎｋｏｄｅはｿﾌﾄｳｪｱの開発をしています。",
  "char_filter": ["icu_normalizer"]
}

# result
{
  "tokens" : [
    {
      "token" : "(株)linkodeはソフトウェアの開発をしています。",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "word",
      "position" : 0
    }
  ]
}

公司（㈱）已经被拓展成为股份有限公司，并统一了小写英文字母，同时半角片假名变成了全角。

分词器 cí qì)

具有将字符串分割为单词级别的功能。
指定使用N-gram等方法进行分割。

令牌过滤器

在经过标记器分割的内容上执行所需的操作（添加、删除、更改）。

ICU转换

使用各种方法来处理Unicode文本，例如将大写字母和小写字母映射、规范化、音译、双向文本处理等。通过指定ID来确定要进行哪种转换。

将片假名转换为平假名

以下是一個例子.

analyzer: {
  'ngram_analyzer': {
    tokenizer: 'ngram_tokenizer',
    filter: ['kana_filter'],
    char_filter: [:icu_normalizer]  # Character filter
  },
}

# Token filter
'kana_filter': {
  type: :icu_transform,
  id: 'Katakana-Hiragana'
}

# Tokenizer
'ngram_tokenizer': {
  type: :ngram,
  min_gram: 1,
  max_gram: 4,
   # 対象とする文字列の種類
   # symbol が設定されていないので、記号は弾かれる
   # https://christina04.hatenablog.com/entry/2015/02/02/225734
  token_chars: %i(letter digit) 
}

curl -H "Content-Type: application/json" -XGET 'localhost:9200/<index>/_analyze' -d '
{ "analyzer": "ngram_analyzer",
  "text" : "nttデータ"
}'

可以看出，「nttデータ」通过kana_filter将「データ」转换为「でえた」。

{
  "tokens": [
    {
      "token": "n",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "nt",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "ntt",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "nttで",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 3
    },
    {
      "token": "t",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 4
    },
    {
      "token": "tt",
      "start_offset": 1,
      "end_offset": 3,
      "type": "word",
      "position": 5
    },
    {
      "token": "ttで",
      "start_offset": 1,
      "end_offset": 4,
      "type": "word",
      "position": 6
    },
    {
      "token": "ttでえ",
      "start_offset": 1,
      "end_offset": 5,
      "type": "word",
      "position": 7
    },
    {
      "token": "t",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 8
    },
    {
      "token": "tで",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 9
    },
    {
      "token": "tでえ",
      "start_offset": 2,
      "end_offset": 5,
      "type": "word",
      "position": 10
    },
    {
      "token": "tでえた",
      "start_offset": 2,
      "end_offset": 6,
      "type": "word",
      "position": 11
    },
    {
      "token": "で",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 12
    },
    {
      "token": "でえ",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 13
    },
    {
      "token": "でえた",
      "start_offset": 3,
      "end_offset": 6,
      "type": "word",
      "position": 14
    },
    {
      "token": "ー",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 15
    },
    {
      "token": "ーた",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 16
    },
    {
      "token": "た",
      "start_offset": 5,
      "end_offset": 6,
      "type": "word",
      "position": 17
    }
  ]
}

请参考这篇文章。

(Note: The given phrase “参考記事” is already in Chinese.)