有关ElasticSearch分析器的内容
分析仪的构建方式
-
- Character filters
-
- Tokenizer (必須)
- Token filters
分析器的处理流程
<html><body>Quick Brown Fox!</body></html>
↓ char_filter => html_strip
Quick Brown Fox!
↓ tokenizer => whitespace
Quick
Brown
Fox!
↓ filter => lowercase
quick
brown
fox!
字符过滤器
在对文本进行分词之前,进行必要的处理(添加、删除、修改)的操作。
'ngram_analyzer': {
tokenizer: 'ngram_tokenizer',
filter: ['kana_filter'],
char_filter: [:icu_normalizer] # <- ?
}
ICU规范化程序
將Unicode進行規範化的方法
例如
GET _analyze
{
"text": "㈱Linkodeはソフトウェアの開発をしています。",
"char_filter": ["icu_normalizer"]
}
# result
{
"tokens" : [
{
"token" : "(株)linkodeはソフトウェアの開発をしています。",
"start_offset" : 0,
"end_offset" : 25,
"type" : "word",
"position" : 0
}
]
}
公司(㈱)已经被拓展成为股份有限公司,并统一了小写英文字母,同时半角片假名变成了全角。
分词器 cí qì)
具有将字符串分割为单词级别的功能。
指定使用N-gram等方法进行分割。
令牌过滤器
在经过标记器分割的内容上执行所需的操作(添加、删除、更改)。
ICU转换
使用各种方法来处理Unicode文本,例如将大写字母和小写字母映射、规范化、音译、双向文本处理等。通过指定ID来确定要进行哪种转换。
将片假名转换为平假名
以下是一個例子.
analyzer: {
'ngram_analyzer': {
tokenizer: 'ngram_tokenizer',
filter: ['kana_filter'],
char_filter: [:icu_normalizer] # Character filter
},
}
# Token filter
'kana_filter': {
type: :icu_transform,
id: 'Katakana-Hiragana'
}
# Tokenizer
'ngram_tokenizer': {
type: :ngram,
min_gram: 1,
max_gram: 4,
# 対象とする文字列の種類
# symbol が設定されていないので、記号は弾かれる
# https://christina04.hatenablog.com/entry/2015/02/02/225734
token_chars: %i(letter digit)
}
curl -H "Content-Type: application/json" -XGET 'localhost:9200/<index>/_analyze' -d '
{ "analyzer": "ngram_analyzer",
"text" : "nttデータ"
}'
可以看出,「nttデータ」通过kana_filter将「データ」转换为「でえた」。
{
"tokens": [
{
"token": "n",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "nt",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "ntt",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "nttで",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "t",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 4
},
{
"token": "tt",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 5
},
{
"token": "ttで",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 6
},
{
"token": "ttでえ",
"start_offset": 1,
"end_offset": 5,
"type": "word",
"position": 7
},
{
"token": "t",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 8
},
{
"token": "tで",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 9
},
{
"token": "tでえ",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 10
},
{
"token": "tでえた",
"start_offset": 2,
"end_offset": 6,
"type": "word",
"position": 11
},
{
"token": "で",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 12
},
{
"token": "でえ",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 13
},
{
"token": "でえた",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 14
},
{
"token": "ー",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 15
},
{
"token": "ーた",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 16
},
{
"token": "た",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 17
}
]
}
请参考这篇文章。
(Note: The given phrase “参考記事” is already in Chinese.)