我尝试使用Elasticsearch

因为我完全不了解Elasticsearch是什么,所以我安装了它并试着使用了一下。

准备

首先,在本地搭建运行环境。

操作系统:Windows 11 Home 22H2
引擎:elasticsearch 8.10.2
工具:kibana 8.10.2
语言:C#

请从以下网站下载elasticsearch,并将其解压缩到一个合适的文件夹中。
https://www.elastic.co/jp/downloads/elasticsearch

然后,从以下网站下载kibana并将其解压缩到任意文件夹中。
据说kibana是与Elasticsearch协作的工具,用于探索、可视化和分析数据。
https://www.elastic.co/jp/downloads/kibana

在本地环境中,由于只是做一些试验,所以我将config文件中的security设置为false,并且由于在Elasticsearch启动时出现了OutOfMemory的问题,所以我修改了jvm.options。

xpack.security.enabled: false
-Xms4g
-Xmx4g

另外,打开命令提示符并转到elasticsearch-8.10.2\bin目录,执行以下命令来安装插件。

elasticsearch-plugin install analysis-kuromoji
elasticsearch-plugin install analysis-icu

打开命令提示符并转到elasticsearch-8.10.2\bin目录,启动Elasticsearch。

elasticsearch.bat

打开浏览器并输入以下URL,如果显示出相关信息就可以了。

http://localhost:9200/
image.png
kibana.bat

请打开浏览器并访问以下URL,如果页面显示正常,则表示OK。

http://localhost:5601/
image.png

尝试使用NGram

当您执行以下的C#程序时,将会创建一个使用NGram进行索引设置的Tokenizer。

<Project Sdk="Microsoft.NET.Sdk">
    <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFramework>net7.0</TargetFramework>
        <ImplicitUsings>enable</ImplicitUsings>
        <Nullable>enable</Nullable>
    </PropertyGroup>
    <ItemGroup>
        <PackageReference Include="Elastic.Clients.Elasticsearch" Version="8.10.0" />
        <PackageReference Include="System.Text.Encoding.CodePages" Version="7.0.0" />
    </ItemGroup>
</Project>
using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);

var isa = new IndexSettingsAnalysis();

// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
    MinGram = 1,
    MaxGram = 10
});

// Tokenizers
isa.TokenFilters = new TokenFilters();
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
    Tokenizer = "custom_ngram_tokenizer"
});
// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
    .Settings(s => s
        .Index(
            new IndexSettings()
            {
                Analysis = isa,
                MaxNgramDiff = 10
            })
    )
);

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));
image.png
GET /my-index/_settings
image.png
GET /my-index/_analyze
{
  "analyzer": "custom_ngram_analyzer",
  "text": "ロキソプロフェン錠60mg"
}
image.png

我尝试使用kuromoji。

这次我将尝试使用支持日语的kuromoji分词器。执行以下的C#程序将会创建custom_kuromoji_analyzer。

using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);

var isa = new IndexSettingsAnalysis();

// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
    MinGram = 1,
    MaxGram = 10
});
isa.Tokenizers.Add("custom_kuromoji_tokenizer", new KuromojiTokenizer()
{
    Mode = KuromojiTokenizationMode.Search,
    DiscardPunctuation = true
});
// TokenFilters
isa.TokenFilters = new TokenFilters();
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
    Tokenizer = "custom_ngram_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
    Tokenizer = "custom_kuromoji_tokenizer",
});
// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
    .Settings(s => s
        .Index(
            new IndexSettings()
            {
                Analysis = isa,
                MaxNgramDiff = 10
            })
    )
);

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));

请使用custom_kuromoji_analyzer将“ロキソプロフェン錠60mg”进行与之前相同的分词处理。
请在Kibana的Dev Tools中运行以下命令。

GET /my-index/_analyze
{
  "analyzer": "custom_kuromoji_analyzer",
  "text": "ロキソプロフェン錠60mg"
}
image.png

尝试使用kuromoji将文本转换为假名。

然后,使用kuromoji的TokenFilter将其转换为假名。执行下面的C#程序将创建custom_kuromoji_yomi_analyzer。

using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);

var isa = new IndexSettingsAnalysis();

// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
    MinGram = 1,
    MaxGram = 10
});
isa.Tokenizers.Add("custom_kuromoji_tokenizer", new KuromojiTokenizer()
{
    Mode = KuromojiTokenizationMode.Search,
    DiscardPunctuation = true
});
// TokenFilters
isa.TokenFilters = new TokenFilters();
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
    Tokenizer = "custom_ngram_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
    Tokenizer = "custom_kuromoji_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_yomi_analyzer", new CustomAnalyzer()
{
    Tokenizer = "custom_kuromoji_tokenizer",
    Filter = new List<string>{ "kuromoji_readingform" }
});
// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
    .Settings(s => s
        .Index(
            new IndexSettings()
            {
                Analysis = isa,
                MaxNgramDiff = 10
            })
    )
);

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));

请尝试在Kibana的Dev Tools中使用custom_kuromoji_yomi_analyzer将”ロキソプロフェン錠60mg”进行分词,就像之前所介绍的那样执行以下命令。

GET /my-index/_analyze
{
  "analyzer": "custom_kuromoji_yomi_analyzer",
  "text": "ロキソプロフェン錠60mg"
}
image.png

试着删除不必要的文字

我将尝试使用CharFilter来删除不必要的字符。

using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);

var isa = new IndexSettingsAnalysis();

// Charfilters
isa.CharFilters = new CharFilters();
isa.CharFilters.Add("custom_char_filter", new PatternReplaceCharFilter()
{
    Pattern = "[0-9]|mg",
    Replacement = ""
});
// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
    MinGram = 1,
    MaxGram = 10
});
isa.Tokenizers.Add("custom_kuromoji_tokenizer", new KuromojiTokenizer()
{
    Mode = KuromojiTokenizationMode.Search,
    DiscardPunctuation = true
});
// TokenFilters
isa.TokenFilters = new TokenFilters();
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
    Tokenizer = "custom_ngram_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
    Tokenizer = "custom_kuromoji_tokenizer"
});
isa.Analyzers.Add("custom_kuromoji_yomi_analyzer", new CustomAnalyzer()
{
    CharFilter = new List<string>{ "custom_char_filter" },
    Tokenizer = "custom_kuromoji_tokenizer",
    Filter = new List<string>{ "kuromoji_readingform" }
});
// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
    .Settings(s => s
        .Index(
            new IndexSettings()
            {
                Analysis = isa,
                MaxNgramDiff = 10
            })
    )
);

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));

请尝试在Kibana的开发工具中运行以下命令,像之前一样使用”custom_kuromoji_yomi_analyzer”对”ロキソプロフェン錠60mg”进行分词处理。

GET /my-index/_analyze
{
  "analyzer": "custom_kuromoji_yomi_analyzer",
  "text": "ロキソプロフェン錠60mg"
}
image.png
isa.TokenFilters.Add("custom_edge_ngram_filter", new EdgeNGramTokenFilter()
{
    MinGram = 1,
    MaxGram = 10
});
isa.Analyzers.Add("custom_kuromoji_yomi_analyzer", new CustomAnalyzer()
{
    CharFilter = new List<string>{ "custom_char_filter" },
    Tokenizer = "custom_kuromoji_tokenizer",
    Filter = new List<string>{ "kuromoji_readingform", "custom_edge_ngram_filter" }
});
image.png
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
    CharFilter = new List<string>{ "custom_char_filter" },
    Tokenizer = "custom_ngram_tokenizer",
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
    CharFilter = new List<string>{ "custom_char_filter" },
    Tokenizer = "custom_kuromoji_tokenizer",
});

创建索引并进行搜索试试看

我已经创建了三个分析器。最后,使用custom_ngram_analyzer和custom_kuromoji_yomi_analyzer来创建一个医药品搜索索引。

医药品的主文件将从以下网站下载全部文件(ZIP:898KB)。
https://www.ssk.or.jp/smph/seikyushiharai/tensuhyo/kihonmasta/kihonmasta_04.html

在以下的程序中,我们将使用custom_ngram_analyzer和custom_kuromoji_yomi_analyzer对字段进行映射,并创建索引。
请根据您自己的环境修改CSV的路径。

using System.Text;
using Elastic.Clients.Elasticsearch;
using Elastic.Clients.Elasticsearch.Analysis;
using Elastic.Clients.Elasticsearch.IndexManagement;
using Elastic.Clients.Elasticsearch.Mapping;

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

var settings = new ElasticsearchClientSettings(new Uri("http://localhost:9200"));
var client = new ElasticsearchClient(settings);

var isa = new IndexSettingsAnalysis();

// Charfilters
isa.CharFilters = new CharFilters();
isa.CharFilters.Add("custom_char_filter", new PatternReplaceCharFilter()
{
    Pattern = "[0-9]|mg",
    Replacement = ""
});
// Tokenizers
isa.Tokenizers = new Tokenizers();
isa.Tokenizers.Add("custom_ngram_tokenizer", new NGramTokenizer()
{
    MinGram = 1,
    MaxGram = 10
});
isa.Tokenizers.Add("custom_kuromoji_tokenizer", new KuromojiTokenizer()
{
    Mode = KuromojiTokenizationMode.Search,
    DiscardPunctuation = true
});
// TokenFilters
isa.TokenFilters = new TokenFilters();
isa.TokenFilters.Add("custom_edge_ngram_filter", new EdgeNGramTokenFilter()
{
    MinGram = 1,
    MaxGram = 10
});
// Analyzers
isa.Analyzers = new Analyzers();
isa.Analyzers.Add("custom_ngram_analyzer", new CustomAnalyzer()
{
    CharFilter = new List<string>{ "custom_char_filter" },
    Tokenizer = "custom_ngram_tokenizer",
});
isa.Analyzers.Add("custom_kuromoji_analyzer", new CustomAnalyzer()
{
    CharFilter = new List<string>{ "custom_char_filter" },
    Tokenizer = "custom_kuromoji_tokenizer",
});
isa.Analyzers.Add("custom_kuromoji_yomi_analyzer", new CustomAnalyzer()
{
    CharFilter = new List<string>{ "custom_char_filter" },
    Tokenizer = "custom_kuromoji_tokenizer",
    Filter = new List<string>{ "kuromoji_readingform", "custom_edge_ngram_filter" }
});
isa.Analyzers.Add("keyword_search_analyzer", new CustomAnalyzer()
{
    CharFilter = new List<string>{ "custom_char_filter" },
    Tokenizer = "whitespace"
});

// delete index & create index
var delRes = await client.Indices.DeleteAsync("my-index");
var creRes = client.Indices.Create("my-index", c => c
    .Settings(s => s
        .Index(
            new IndexSettings()
            {
                Analysis = isa,
                MaxNgramDiff = 10
            })
    )
    .Mappings(map => map
        .Properties(
            new Properties<YakName>()
            {
                { "yakNameNgram", new TextProperty()
                    {
                        Analyzer = "custom_ngram_analyzer",
                        SearchAnalyzer = "keyword_search_analyzer"
                    }
                },
                { "yakNameKana", new TextProperty()
                    {
                        Analyzer = "custom_kuromoji_yomi_analyzer",
                        SearchAnalyzer = "keyword_search_analyzer"
                    }
                }
            }
        )
    )
);

using (StreamReader reader = new StreamReader(@"C:\es\y_ALL20230920.csv", Encoding.GetEncoding("Shift_JIS")))
{
    while (!reader.EndOfStream)
    {
        string line = reader.ReadLine();
        string[] values = line.Split(',');
        var yak_name = values[4].Trim(new char[] { '"' });
        if (String.IsNullOrEmpty(yak_name))
        {
            continue;
        }
        var yakName = new YakName
        {
            YakNameNgram = yak_name,
            YakNameKana = yak_name
        };
        var indexRes = await client.IndexAsync(yakName, "my-index");
    }
}

Console.WriteLine(DateTime.Now.ToString("yyyy/MM/dd HH:mm:ss.fff"));

public class YakName
{
    public string YakNameNgram { get; set; }
    public string YakNameKana { get; set; }
}

我认为创建索引只需要几分钟时间。
完成后,请在Kibana的Dev Tools中运行以下命令,尝试搜索药品。

GET /my-index/_search
{
  "query": {
    "bool": {
      "should": [
	      { "match": { "yakNameNgram": "恵美須" } }
      ]
    }
  },
  "sort" : [{"_score":"desc"}]
}
image.png
GET /my-index/_search
{
  "query": {
    "bool": {
      "should": [
	      { "match": { "yakNameKana": "ショウサンギン" } }
      ]
    }
  },
  "sort" : [{"_score":"desc"}]
}
image.png
广告
将在 10 秒后关闭
bannerAds