通过使用亚马逊 Elasticsearch 改变了搜索机制的故事

2 年 ago

逸, 科

4 minutes

因为在工作中要使用Amazon Elasticsearch进行搜索引擎的導入，所以打算把当时的工作记录下来。
顺便说一下，这里提供的代码示例是在实际的示例项目中所用的代码。

2016年9月13日追記
在我负责之前，使用的是一个不同的搜索引擎而不是elasticsearch的设计，但是由于权限数据是分层结构的，当数据量变大时，性能显著下降。为了改善这个问题，我们决定重新设计并实现了一个使用最广泛的Elasticsearch的数据结构。

做过的事情 de

バックアップ・リストア

环境

elasticsearch-model

做过的事情的具体细节

我将详细描述我做过的事情。

创建索引/创建别名/数据导入批处理

我在研究各种学习会的资料时发现，考虑到索引的更新和操作，使用搜索处理时不直接指定索引，而是设置别名指定，可以在零停机时间内进行各种操作。因此，在初始构建时我决定在批处理中同时创建索引和别名。创建别名后，在head插件上会显示如下信息。

示例代码如下。

1 require 'optparse'
  2
  3 class SetupElasticsearch
  4   class << self
  5     def execute
  6       logger = ActiveSupport::Logger.new("log/#{class_name}_batch.log", 'daily')
  7       force = args[:force] || false
  8
  9       Photo.create_index!(force: force)
 10       Photo.create_alias!
 11       # importする
 12     end
 13
 14     private
 15
 16     def args
 17       options = {}
 18
 19       OptionParser.new { |o|
 20         o.banner = "Usage: #{$0} [options]"
 21         o.on("--force=OPT", "option1") { |v| options[:force] = v }
 22       }.parse!(ARGV.dup)
 23
 24       options
 25     end
 26   end
 27 end

 1 module Searchable
  2   extend ActiveSupport::Concern
  3
  4   included do
  5     include Elasticsearch::Model
  6     include Elasticsearch::Model::Callbacks
  7
  8     unless Rails.env.test?
  9       after_save :transfer_to_elasticsearch
 10       after_destroy :remove_from_elasticsearch
 11     end
 12
 13     # Set up index configuration and mapping
 14     settings index: {
 15       number_of_shards:   5,
 16       number_of_replicas: 1,
 17       analysis: {
 18         filter: {
 19           pos_filter: {
 20             type:     'kuromoji_part_of_speech',
 21             stoptags: ['助詞-格助詞-一般', '助詞-終助詞']
 22           },
 23           greek_lowercase_filter: {
 24             type:     'lowercase',
 25             language: 'greek'
 26           },
 27           kuromoji_ks: {
 28             type: 'kuromoji_stemmer',
 29             minimum_length: '5'
 30           }
 31         },
 32         tokenizer: {
 33           kuromoji: {
 34             type: 'kuromoji_tokenizer'
 35           },
36           ngram_tokenizer: {
 37             type: 'nGram',
 38             min_gram: '2',
 39             max_gram: '3',
 40             token_chars: %w(letter digit)
 41           }
 42         },
 43         analyzer: {
 44           kuromoji_analyzer: {
 45             type:      'custom',
 46             tokenizer: 'kuromoji_tokenizer',
 47             filter:    %w(kuromoji_baseform pos_filter greek_lowercase_filter cjk_width)
 48           },
 49           ngram_analyzer: {
 50             tokenizer: "ngram_tokenizer"
 51           }
 52         }
 53       }
 54     } do
 55       mapping _source: { enabled: true },
 56               _all: { enabled: true, analyzer: "kuromoji_analyzer" } do
 57         indexes :id, type: 'integer', index: 'not_analyzed'
 58         indexes :description, type: 'string', analyzer: 'kuromoji_analyzer'
 59       end
 60     end
 61
 62     def as_indexed_json(_options = {})
 63       as_json
 64     end
 65
 66     def transfer_to_elasticsearch
 67       __elasticsearch__.client.index  index: index_name, type: 'photo', id: id, body: as_indexed_json
 68     end
 69
 70     def remove_from_elasticsearcha
 71       __elasticsearch__.client.delete index: index_name, type: 'photo', id: id
 72     end
 73   end
 74
 75   module ClassMethods
 76     def create_index!(options = {})
 77       client = __elasticsearch__.client
 78       client.indices.delete index: Consts::Elasticsearch[:index_name][:photo] if options[:force]
 79       client.indices.create index: Consts::Elasticsearch[:index_name][:photo],
 80         body: {
 81           settings: settings.to_hash,
 82           mappings: mappings.to_hash
 83         }
          }
 84     end
 85
 86     def create_alias!
 87       client = __elasticsearch__.client
 88       if client.indices.exists_alias? name: Consts::Elasticsearch[:alias_name][:photo]
 89         client.indices.delete_alias index: Consts::Elasticsearch[:index_name][:photo], alias_name: Consts::Elasticsearch[:alias_name][:photo]
 90       end
 91
 92       client.indices.put_alias index: Consts::Elasticsearch[:index_name][:photo], name: Consts::Elasticsearch[:alias_name][:photo]
 93     end
94
 95     def bulk_import
 96       client = __elasticsearch__.client
 97
 98       find_in_batches do |entries|
 99         result = client.bulk(
100           index: index_name,
101           type: document_type,
102           body: entries.map { |entry| { index: { _id: entry.id, data: entry.as_indexed_json } } },
103           refresh: (i > 0 && i % 3 == 0), # NOTE: 定期的にrefreshしないとEsが重くなる
104         )
105      end
106     end
107   end
108 end

使用create_index命令创建索引，并使用alias命令为索引创建别名。然后使用bulk_import将数据通过bulk API导入到elasticsearch中。

创建搜索处理

我正在使用elasticsearch-rails，匹配器使用的是simple_query_string。

simple_query_string:
  { query: @condition_params[:keyword],
    fields: ['name', 'description'],
    default_operator: 'and'
  }

查询的格式如下所示。（参考）

{"query":{"bool":{
  "must":[
    {"term":{"owner_id":1}},
    {"term":{"type":"T"}},
    {"simple_query_string":{
       "query":"天気","fields":[
         "name","description"
       ],
       "default_operator":"and"
      }
    },
    {"term":{"creator_id":24383}}
  ],
  "must_not":[
    {"term":{"public_flag":0}}
  ],
  "should":[
    {"term":{"permission":"hogehoge"}},
    {"term":{"permission2":"fugafuga"}}
  ]
}},
"size":10,
"sort":[{"id":"desc"}]
}

我认为还有其他更多的准备方法，但是。。。

搜索结果的微调整

将score的最小值设定为某个值。

由于有时搜索结果的匹配率较低，所以我在查询中添加了一个min_score参数，将score值较低的结果排除掉。

分页

在Elasticsearch查询中表达分页时，使用from和size。

{"query":{"bool":{
  "must":[
    ・・・・・・・・・
  ],
  "must_not":[
    ・・・・・・・・・
  ],
  "should":[
　　　　　　　　・・・・・・・・・
  ]
}},
"from": 0, 
"size":10,
"sort":[{"id":"desc"}]
}

如果使用elasticsearch-model来实现这个任务的话，

 @client.search(query).offset(params[:page]).limit(params[:offset]).records

使用offset方法和limit方法就像这样。

考试（对每个模块的测试，以及整体的测试）

我使用rspec为每个类编写了代码。我将query中的must/shoud/not_must/sort分别分为不同的类来创建查询语句。然后，我为每个类编写了测试（以确保生成了正确的查询）。我还修改了spec/requests文件夹下的测试，使用elasticsearch进行测试，但正如下面所述，在本地执行测试时需要指定参数才能运行测试。

在Circle CI上安装Elasticsearch并运行测试。

在circle.yml中安装elasticsearch，以便在Circle CI上仅启动使用elasticsearch的测试。

 13 dependencies:
 14   cache_directories:
 15     - "~/docker"
 16   override:
 17     - bundle check --path=vendor/bundle || bundle install --path=vendor/bundle --j    obs=4 --retry=3
 18     - if [[ ! -e elasticsearch-2.3.5 ]]; then wget https://download.elasticsearch.    org/elasticsearch/elasticsearch/elasticsearch-2.3.5.tar.gz && tar -xvf elasticsear    ch-2.3.5.tar.gz; fi
 19     - elasticsearch-2.3.5/bin/plugin install analysis-kuromoji
 20     - elasticsearch-2.3.5/bin/elasticsearch: {background: true}
 21     - sleep 10 && curl --retry 10 --retry-delay 5 -v http://127.0.0.1:9200/
・・・・・・・・・・・・・
・・・・・・
 43
 44 test:
 45   override:
 46     - CI=true RAILS_ENV=test bundle exec rspec spec

我在RSpec的一侧稍作补充。

 53   # 一部のテストを環境によっては実行させないようにするため追加
 54   config.filter_run_excluding broken: true unless ENV['CI']


describe PhotoSearch, broken: true do

end

备份和恢复

备份不使用Amazon Elasticsearch Service的自动备份，而是手动备份。需要注意的是角色等设置。

        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:s3:::staging-backup",
                "arn:aws:iam::319807237558:role/EC2StagingApplication"
            ]
        }

"Principal": {
        "Service": [
                 "es.amazonaws.com",
   　　　・・・・・・
         ]
  }

我使用elasticsearch-ruby/elasticsearch-api/lib/elasticsearch/api/actions/snapshot/中的代码作为参考来编写。

结果

我认为最大的改变是修复了数据结构的问题，原本需要大约10秒的响应时间现在已经减少到了4秒以内。